{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5Rsndc93i52B"
   },
   "source": [
    "Authors: Andreas Haupt, Jannis Kück, Alexander Quispe, Anzony Quispe, Vasilis Syrgkanis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fFIsBLlF7YFv"
   },
   "source": [
    "# Machine Learning Estimators for Wage Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ssGRQl-d7U9O"
   },
   "source": [
    "We illustrate how to predict an outcome variable $Y$ in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. So far we have used linear prediction rules, e.g. Lasso regression, for estimation.\n",
    "Now, we also consider nonlinear prediction rules including tree-based methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yYmcd6mN7VCV"
   },
   "source": [
    "## Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1n1-LWsu53N6"
   },
   "source": [
    "Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.\n",
    "The preproccessed sample consists of $5150$ never-married individuals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1or5aUNr7yTv"
   },
   "source": [
    "Set the following file_directory to a place where you downloaded https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "57TFoHNk8BIg",
    "outputId": "09a65cae-699a-41da-9605-ab8bab757fdc"
   },
   "outputs": [],
   "source": [
    "# Import relevant packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import cross_val_score, KFold, GridSearchCV\n",
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso\n",
    "import patsy\n",
    "import warnings\n",
    "from sklearn.base import BaseEstimator, clone\n",
    "import statsmodels.api as sm\n",
    "warnings.simplefilter('ignore')\n",
    "np.random.seed(1234)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\"\n",
    "data = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "fUxf95F1B2EE",
    "outputId": "c4931ab3-06d8-4f5c-d640-950932fef064"
   },
   "outputs": [],
   "source": [
    "y = np.log(data['wage']).values\n",
    "Z = data.drop(['wage', 'lwage'], axis=1)\n",
    "Z.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v2E-oxA8DQqH"
   },
   "source": [
    "The following figure shows the weekly wage distribution from the US survey data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 313
    },
    "id": "4QJc_mnlB2KN",
    "outputId": "b9b4fe0c-c173-4f84-ec40-0de80bcbcc35"
   },
   "outputs": [],
   "source": [
    "plt.hist(data.wage , bins = np.arange(0, 350, 20) )\n",
    "plt.xlabel('hourly wage')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title( 'Empirical wage distribution from the US survey data' )\n",
    "plt.ylim((0, 3000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TvMhq0RNDL43"
   },
   "source": [
    "Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by\n",
    "the logarithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xBL1FmvgDV3f"
   },
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nDbY6BAuDYVD"
   },
   "source": [
    "Due to the skewness of the data, we are considering log wages which leads to the following regression model\n",
    "\n",
    "$$\\log(\\operatorname{wage}) = g(Z) + \\epsilon.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jNACoPwVDdpK"
   },
   "source": [
    "We will estimate the two sets of prediction rules: Linear and Nonlinear Models.\n",
    "In linear models, we estimate the prediction rule of the form\n",
    "\n",
    "$$\\hat g(Z) = \\hat \\beta'X.$$\n",
    "Again, we generate $X$ in two ways:\n",
    " \n",
    "1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators).\n",
    "\n",
    "\n",
    "2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., $\\operatorname{exp}^2$ and $\\operatorname{exp}^3$) and additional two-way interactions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iDM3apFhDlgf"
   },
   "source": [
    "To evaluate the out-of-sample performance, we split the data first."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PPt0pZtgOMSn"
   },
   "source": [
    "We are starting by running a simple OLS regression. We fit the basic and flexible model to our training data by running an ols regression and compute the R-squared on the test sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Low dimensional specification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Zbase = patsy.dmatrix('0 + sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)',\n",
    "                      Z, return_type='dataframe').values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(Zbase, y, test_size=0.25, random_state=123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr_base = LinearRegression().fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's calculate R-squared on the test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r2_base = 1 - np.mean((y_test - lr_base.predict(X_test))**2) / np.var(y_test)\n",
    "print(f'{r2_base:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In fact `sklearn` provides an implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'{r2_score(y_test, lr_base.predict(X_test)):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since out of sample performance can be varying for different train-test splits, it is more stable to look at average performance across multiple splits, using K-fold cross validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LinearRegression(), Zbase, y, scoring='r2', cv=cv)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### High-dimensional specification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jX09oKqhRJgz"
   },
   "source": [
    "We repeat the same procedure for the flexible model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Zflex = patsy.dmatrix('0 + sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)',\n",
    "                      Z, return_type='dataframe').values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(Zflex, y, test_size = 0.25, random_state = 123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr_flex = LinearRegression().fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'{r2_score(y_test, lr_flex.predict(X_test)):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, OLS can be quite un-stable for such high-dimensional problems and it really matters what solution is being returned among the multitude of solutions to the least squares objectives (which are non-unique in high-dimensional settings). For instance, we see that the `sklearn` implementation returns a numerically un-stable solution whose error blows up in some cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LinearRegression(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`sklearn`'s implementation uses the least squares solver from `scipy.linalg.lstsq`. If for instance we instead use the pseudo-inverse based implementation we get a different result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyOLS(BaseEstimator):\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        CXX = (X.T @ X) / X.shape[0]\n",
    "        CXy = (X.T @ y) / X.shape[0]\n",
    "        self.coef_ = np.linalg.pinv(CXX) @ CXy \n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        return X @ self.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(MyOLS(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This also recovers the solution provided by `statsmodels.api.OLS`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class StatsModelsOLS(BaseEstimator):\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        self.ols_ = sm.OLS(y, X).fit()\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        X = np.hstack([np.ones((X.shape[0], 1)), X])\n",
    "        return self.ols_.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(StatsModelsOLS(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also choose different solvers by using `sklearn.linear_model.Ridge` which allows for no penalty and a multitude of solvers. We see that the `lsqr` solver is more stable than solvers based on singular value decompositions of the covariance matrix $E_n[X X']$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Ridge(alpha=0.0, solver='lsqr'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Ridge(alpha=0.0, solver='cholesky'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Ridge(alpha=0.0, solver='svd'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Penalized Regressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-4QL8R2OUbT_"
   },
   "source": [
    "We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We are proceeding by running penalized regressions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we try a pure `l1` penalty, tuned using cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LassoCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oops! For penalized regressions it is important that our features have the same standard deviation, so that we are symmetrically penalizing them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "Zflex = StandardScaler().fit_transform(Zflex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try again!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(LassoCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we try a pure `l2` penalty, tuned using cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(RidgeCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we try an equal combination of the two penalties, with the overall weight tuned using cross validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(ElasticNetCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also try a variant of the `l1` penalty, where the weight is chosen based on theoretical derivations. This is a based on a Python implementation that tries to replicate the main function of hdm r-package. It was made by [Max Huppertz](https://maxhuppertz.github.io/code/). His library is this [repository](https://github.com/maxhuppertz/hdmpy). Download its repository and copy this folder to your site-packages folder. In my case it is located here ***C:\\Python\\Python38\\Lib\\site-packages*** . It requires the multiprocess package ***pip install multiprocess***."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We wrap the package so that it has the familiar sklearn API\n",
    "import hdmpy\n",
    "\n",
    "class RLasso(BaseEstimator):\n",
    "    \n",
    "    def __init__(self, *, post=True):\n",
    "        self.post = post\n",
    "    \n",
    "    def fit(self, X, y):\n",
    "        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n",
    "        return self\n",
    "    \n",
    "    def predict(self, X):\n",
    "        return X @ np.array(self.rlasso_.est['beta']).flatten() + self.rlasso_.est['intercept'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(RLasso(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we try the combination of a sparse and a dense coefficient using the LAVA method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We construct an sklearn API estimator that implements the LAVA method\n",
    "\n",
    "class Lava(BaseEstimator):\n",
    "    \n",
    "    def __init__(self, *, alpha2=1, iterations=3):\n",
    "        self.alpha2 = alpha2\n",
    "        self.iterations = iterations\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        lasso = RLasso(post=False).fit(X, y)\n",
    "        ridge = Ridge(self.alpha2).fit(X, y - lasso.predict(X).flatten())\n",
    "\n",
    "        for _ in range(self.iterations - 1):\n",
    "            lasso = lasso.fit(X, y - ridge.predict(X))\n",
    "            ridge = ridge.fit(X, y - lasso.predict(X).flatten())\n",
    "\n",
    "        self.lasso_ = lasso\n",
    "        self.ridge_ = ridge\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        return self.lasso_.predict(X) + self.ridge_.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = KFold(n_splits=5, shuffle=True, random_state=123)\n",
    "rsquares = cross_val_score(Lava(alpha2=20), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)\n",
    "print(f'{np.mean(rsquares):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We find that for this dataset the low dimensional OLS was the best among all specifications. The high-dimensional approaches did not manage to increase the explainability power of the outcome."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "PM2A_prediction",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
